Issues and Methodology for Template Design for Information Extraction

نویسنده

  • Boyan A. Onyshkevych
چکیده

The goal of Information Extraction tasks is to identify, categorize, classify, relate, and normalize specific information of interest found in free text, and to make that information available to a back-end data base, data fusion, or other application. A data structure referred to as a template is typically used for capturing such information, particularly in cases where the amount and complexity of information is substantial. The design of the template for such/m application (or exercise) thus defines the task itself and therefore crucially affects the success of the Information Extraction attempt. This paper discusses template structure and methodological issues which arise in the template design process, within the context of a discussion of the design process itself; this paper is based on the template design process for TIPSTER/MUC5 and certain subsequent Information Extraction exercises. The first section of this paper addresses the issue of selection of the appropriate data representation (text annotation vs. flat template representation vs. object-oriented template). The second section outlines a set of high-level design considerations (desiderata) that have emerged; these desiderata feed into the discussion of design elements and a procedural review of the design process (design iterations, use of those linguistic analysis tools, etc.) 1. Data Structure Selection Although the selection of an appropriate data structure for representing extracted data may be influenced by the data structure requirements of the back-end application, the use of straightforward deterministic data format converters can further decouple those two data structure requirements. Thus a data structure can be selected to be appropriate for the data extraction task itself. The data structures for Information Extraction fall into three broad categories: text annotation, fiat data templates, and object-oriented templates. The appropriateness of those three formats to a particular task is primarily based on the richness of the required data complex. If a task calls for a small number of primitive data types, with no requirements for representing interrelations among primitive data types, text annotation may be the simplest representation. This data structure is renderable as tagging delimited text segments with appropriate tags from SGML or another mark-up language (or, equivalently, by an auxiliary file for each document with the tag associated with an offset into that document file). For example, the data from the task of finding company and product names in a text may be most appropriately represented by an annotation scheme. However, if the task also requires the identification of coreferences among names or references in a text and/or association of other attributes of those elements, a template structure may be more appropriate. Flat templates, such as those used in MUC3/MUC4, associate related data elements (either strings from the text, categorization of data, or normalized data). Each such template thus represents a data complex of related information; each complex of data from the text will result in another template (with the same structure) being instanfiated. A fiat template's structure is thus a set of slots (naming the attribute), each with zero, one, or more possible fills (such as strings from the text, numbers, or symbols from a predefined se0. The MUC3/MUC4 templates were flat data structures with 24 slots; there was a requirement to represent relationships between data elements in different slots, which led to some awkwardness. For example, in order to correlate the name of a terrorist target with the nationality of that target, a "cross-reference" notation had to be introduced. In response to such difficulties and because of the richness of the required data complex, the data structure for tasks such as the TIPSTER/MUC5 task is most appropriately object-oriented. In other words, instead of using one template to capture all the relevant information, there are multiple sub-template types (object types), each representing related information, as well as the relationships to other objects. A completed (or instantiated) template is a set of filled-in objects of different types, representing the relevant information from a particular document. Each object thus captures information about one thing (entity), an event, or an interrelation between other objects, A filled-in template for a particular document may, therefore, have zero, one, or more object instanfiations of a given type, A completed template will typically have multiple objects of various types, interconnected by pointers from object to associated object. If there is no information in the document to fill in a given object, that object is not incorporated into the completed template. If a given document is not relevant to the domain, no objects are instantiated (possibly beyond a "header" object which holds the document number, date of analysis, etc.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

سنتز پلیمرهای قالب ملکولی تریازینی: به عنوان فناوری نوین پایش آلاینده های ریزمقدار شغلی

Background and Objective: Biological adsorbents under undesirable conditions have not suitable performance. Based on this problem, the using of the molecular imprinted polymers (MIPs) have been proposed. This study was conducted to adsorption of trace triazinic pesticides with synthesis and optimization of molecular imprinted polymers as a novel solid phase extraction (MISPE). ...

متن کامل

Application of the Response Surface Methodology for the Optimization of the Aqueous Enzymatic Extraction of Pistacia Khinjuk Oil

ABSTRACT: Aqueous enzymatic extraction of oil from pistacia khinjuk was performed using cellulase. The central composite design was used to optimize the parameters that are significant to the process. The influence of three regressors on the percentage of oil recovery from seed was evaluated using second-order polynomial multiple regression model. Analysis of variance showed a high coefficient ...

متن کامل

Underlying Constructs of Farmers’ Perceptions towards Bt Cotton Among Former Cotton Farmers in Northern Ghana: Empirical Application of Q Methodology

It is often argued that learning from best examples in the neighbouring Burkina Faso and elsewhere, Ghana can succeed in revamping the collapsing cotton industry by introducing Bt cotton to farmers. This paper therefore presents a survey findings on farmers’ views and perceptions towards the possible introduction of Bt cotton. A stratified random sampling techniques was applied in selecting 254...

متن کامل

A Soft and Efficient Approach for Removal of Template from Mesoporous Silica using Benzene Sulfonamide

In this contribution, an effective and soft method for removal of template from nanochannels of mesoporous silica (MCM-41) is proposed. This method is based on chemically-modified solvent extraction which enhanced by means of an auxiliary organic compound, i.e. benzene sulfonamide. Template removal was performed in soft condition, i.e. in the presence of diluted sulfuric acid and at ambient tem...

متن کامل

Response surface methodology for optimization of supercritical fluid extraction of orange peel essential oil

Background: Orange peel essential oils were obtained using supercritical fluid extraction. This method is an important high scaling extraction methods that used for extraction of plant and animal extracts. Methodology: The experimental parameters of SFE such as temperature, pressure, and extraction time and modifier volume were optimized using a central composite design after a 24-1fractional ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994